POLI 572B

Michael Weaver

February 28, 2022

Multivariate Least Squares

Objectives

Recap

  • Multivariate Least Squares mathematical properties

Interpretation

  • Best Linear Approximation of CEF
  • Dummy Variables
  • Slopes and Intercepts

Today

  • algorithm with mathematical properties
    • what is the algorithm doing?
    • what are mathematical facts about the algorithm?

Today

  • How does the algorithm relate back to the conditional expectation function?

Today

  • Interpreting simple scenarios:

    • dummy variables
    • slopes and intercepts

https://pastebin.com/Y8AmJK7C

Recap

Multivariate Least Squares:

Matrix equation fits mean of \(Y\) as a linear function of many variables:

\[Y_i = b_0 + b_1 X_{1i} + b_2 X_{2i} + \ldots + b_k X_{ki}\]

\(\hat{Y_i}\) is the predicted mean of \(Y\) given values of \(\mathbf{X}_i\)

Least Squares

We obtain coefficients \(b_0, b_1, \dots, b_k = \beta\):

\[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{{\beta}}\]

This is the matrix formula for least squares regression.

If \(X\) is a column vector of \(1\)s, \(\beta\) is just the mean of \(Y\). (We just did this)

If \(X\) is a column of \(1\)s and a column of \(X_1\), it is an intercept and slope on \(X_1\).

\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}\]

\[\mathbf{X}\beta = \begin{pmatrix} 1 & x_{11} & x_{21} \\ \vdots & \vdots & \vdots \\ 1 & x_{1n} & x_{2n} \end{pmatrix} \begin{pmatrix} b_0 \\ b_1 \\ b_2 \end{pmatrix} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix} = \hat{Y}\]

Mathematical Requirements:

  1. Matrix \(X\) has “full rank”
  • This means that all of the columns of \(X\) are linearly independent.
    • cannot have two identical columns
    • cannot have set of columns that sum up to another column multiplied by a scalar
  • If \(X\) is not full rank, cannot be inverted, cannot do least squares.
  1. \(n \geq p\): we need to have more data points than variables in our equation
  • no longer trivial with multiple regression

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\) or \(b_0\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares.

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(3\). \(\overline{Y} = \overline{X}'\beta\): the predicted value \(\hat{Y}\) when all variables in \(X\) are at their mean is the mean of \(Y\).

Key facts about regression:

The mathematical procedures we use in regression ensure that:

\(4\). Slope \(b_k\) captures change in \(Y\) due to variation in \(X_k\) that is orthogonal/uncorrelated to all other columns of \(X\).

Multivariate Least Squares:

\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]

\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)

where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:

\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{j} X_{j}\)

\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)

Multivariate Least Squares:

How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)

  • The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.

The CEF

The CEF

\[E(Y_i | X_i) = f(X)\]

Conditional: mean of \(Y\) is different at different levels of \(X\)

Expectation: expected value/mean of \(Y\)

Function: mathematical function links values of \(X\) to mean of \(Y\). Could be, does not need to be, linear

Regression and CEF

How does multivariate regression relate back to Conditional Expectation Function?

  • Angrist and Pischke give justification
  • least squares is the best linear approximation of the conditional expectation function \(E(Y_i | X_i)\)

Least Squares and CEF

Least Squares predictions \(\widehat{Y} = X'\beta\) gives the linear approximation of \(E(Y_i | X_i)\) that has the smallest mean-squared error (minimum distance to the true \(E(Y_i | X_i)\)).

  • Least Squares could equivalently fit a line for the mean of Y: \(E(Y) | X\) rather than prediction for each \(Y_i\). We could use \(E(Y_i | X_i)\) as the dependent variable

Exercise: Download the data. Contains mean income for each group of people with the same AGE in years

https://pastebin.com/Y8AmJK7C

Exercise:

  • Plot INCEARN by AGE

Use matrix calculations in R to obtain the linear approximation of \(E(Y_i | X_i)\):

  • What is the intercept? What does it mean?
  • What is the slope? What does it mean?
##           [,1]
## [1,] 17501.992
## [2,]  3668.629

Do your estimates agree? Why or why not?

We are not accounting for the number of people in each value of \(X_i\)

##           [,1]
## [1,] 17501.992
## [2,]  3668.629

Least Squares and CEF

We can reproduce any individual-level least squares by:

  1. Collapsing data to all unique values \(X_i\) to groups: \(X_g\) across \(g \in G\)
  2. Take the group mean of \(Y_i\): \(Y_g\)
  3. Regress \(Y_g\) on \(X_g\), weighting by group size \(n_g\).

We are directly modelling the conditional expectation of \(E(Y_i | X_i)\) as a linear function.

Least Squares and CEF

Least Squares estimates the mean of \(Y\) as a function of values of \(X\).

  • Conditional Expectation Function
  • \(\hat{Y_i} | X_i\) is best linear approximation of \(E(Y | X = X_i)\)
    • best according to “minimum euclidean distance”

Dummy Variables

Dummy Variables

What are they?

  • Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise.

  • Can be used for Yes/No categorical variables
  • Can be used for variables with many mutually exclusive categories (e.g. race, marital status, etc.)

How do we make them?

  • Easy to do in R with as.factor(): each unique value will get a dummy.

lm(y ~ as.factor(any_variable))

Dummy Variables

\(Earnings_i = b_0 + b_1 \ Female_i\)

assuming, as the survey does, the gender binary

  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?

Dummy Variables

\(Earnings_i = b_0 + b_1 \ Female_i + \ b_2 Male_i\)

  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?
  4. What does the \(b_2\) tell us?

Dummy Variables

\(Earnings_i = b_0 \ Female_i + \ b_1 Male_i\)

  1. What does the design matrix \(X\) in this regression look like?
  2. What does the \(b_0\) tell us?
  3. What does the \(b_1\) tell us?
##   (Intercept) FEMALE
## 1           1      0
## 2           1      0
## 3           1      0
## 4           1      1
## 5           1      0
## 6           1      1
## (Intercept)      FEMALE 
##   201941.70   -57269.02
##   (Intercept) FEMALE MALE
## 1           1      0    1
## 2           1      0    1
## 3           1      0    1
## 4           1      1    0
## 5           1      0    1
## 6           1      1    0
## (Intercept)      FEMALE        MALE 
##   201941.70   -57269.02          NA
##   FEMALE MALE
## 1      0    1
## 2      0    1
## 3      0    1
## 4      1    0
## 5      0    1
## 6      1    0
##   FEMALE     MALE 
## 144672.7 201941.7

Dummy Variables

If there is an intercept

  • one category must be excluded (“reference group”); otherwise linear dependence. (R will do this by default)
  • coefficient is difference in means between the group indicated by the dummy and the intercept (“reference group” mean)

If there is no intercept:

  • all categories must be included
  • separate intercepts for each group: coefficient is the mean of the group indicated by the dummy (when all other variables are \(0\)).

Exercise:

How did different regions in the UK vote on Brexit?

  • Download the brexit data: https://pastebin.com/Y8AmJK7C
  • use the table function to find out how what the different Region names are
  • use the hist function to visually examine the distribution of the leave vote (Pct_Lev)

Exercise:

Using the lm function: (lm(y ~ x, data = data))

  1. Regress Pct_Lev on Region
  2. Interpret the coefficients. (summary(your_model))
  3. (Without looking at them) What would the predicted values \(\widehat{Y}\) from this regression tell us?
  4. How should we interpret the residuals \(e\) (or \(Y^*\))?
  5. Regress Pct_Lev on -1 + Region
  6. Interpret the coefficients.
  7. (Without looking at them) Are the predicted values \(\widehat{Y}\) from this regression similar to those from (1)?

Example:

## 
##                     East            East Midlands                   London 
##                       47                       40                       33 
##               North East               North West         Northern Ireland 
##                       12                       39                        1 
##                 Scotland               South East               South West 
##                       32                       67                       37 
##                    Wales            West Midlands Yorkshire and The Humber 
##                       22                       30                       21

Dummy Variables

Model 1
(Intercept) 39.137 (1.394)***
RegionULondon −0.045 (1.956)
RegionUNorth West 16.778 (1.881)***
RegionUYorkshire and The Humber 19.513 (2.215)***
RegionUNorth East 20.342 (2.669)***
RegionUWest Midlands 21.178 (2.004)***
RegionUEast Midlands 20.438 (1.870)***
RegionUSouth West 14.547 (1.904)***
RegionUEast 17.825 (1.807)***
RegionUSouth East 13.033 (1.695)***
RegionUWales 14.211 (2.184)***
RegionUNorthern Ireland 5.083 (8.008)
Num.Obs. 381
R2 0.443
R2 Adj. 0.426
AIC 2668.6
BIC 2719.9
Log.Lik. −1321.302
F 26.679

How is this different from what you produced?

Dummy Variables

During the Brexit vote, many areas thought to support “Remain” experienced heavy rainfall. total_precip_prepolling is the number of mm of rain that fell just before polls were open.

  1. Regress Pct_Lev on total_precip_prepolling + Region.
  2. Thinking about the least squares estimator: how is the slope on total_precip_prepolling on calculated in this case? (hint: what is \(total \ precip \ prepolling^*\))
  3. In light of \(2\), interpret the slope on total_precip_prepolling as precisely as possible?
Model 1
(Intercept) 52.558 (1.496)***
total_precip_prepolling 0.052 (0.012)***
RegionEast Midlands 7.012 (1.927)***
RegionLondon −19.266 (1.775)***
RegionNorth East 6.921 (2.677)*
RegionNorth West 3.357 (1.938)+
RegionNorthern Ireland −8.338 (7.836)
RegionScotland −13.428 (2.021)***
RegionSouth East −3.112 (1.511)*
RegionSouth West 1.110 (1.957)
RegionWales 0.790 (2.220)
RegionWest Midlands 7.755 (2.052)***
RegionYorkshire and The Humber 6.091 (2.248)**
Num.Obs. 381
R2 0.471
R2 Adj. 0.454
AIC 2650.6
BIC 2705.8
Log.Lik. −1311.311
F 27.354

Exercise:

\(2\): We regress Percent Leave on the residual pre-election precipitation.

The other variable are dummies for region. Thus, the \(\widehat{precipitation_i}\) is the mean precipitation in each region.

So, residual \(precipitation_i^*\), is difference between precipitation in a consituency and the mean precipitation in the region.

Slope is relationship between within-region precipitation and leave voteshare.

Dummy Variables

You can change the “reference group” like this. This code shifts “Scotland” to the reference by making it the first “level” in the factor.

Interpretation

Example

Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.

Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.

  • UHRSWORK (the usual hours worked per week)
  • FEMALE (an indicator for female)
  • AGE (in years)
  • LAW (an indicator for being a lawyer)
  • INCEARN (total annual income earned in dollars).

Example

Let’s now find the (approximate) linear conditional expectation function of earnings, across hours worked, age, profession, and gender:

Example

## 
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -304486  -90963  -29238   72845  743690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11083.5     9484.6   1.169    0.243    
## UHRSWORK      1075.5      114.7   9.373   <2e-16 ***
## AGE           3452.8      125.2  27.578   <2e-16 ***
## LAW         -48919.9     2674.6 -18.291   <2e-16 ***
## FEMALE      -37223.1     2799.8 -13.295   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared:  0.146,  Adjusted R-squared:  0.1457 
## F-statistic: 427.2 on 4 and 9995 DF,  p-value: < 2.2e-16

How do we make sense of, e.g., the slope on FEMALE?

Example

Example

Example

Example

Exercise:

Example

Previous slide shows linear approximation of CEF of earnings by age.

In your small groups:

  1. Can you think of a way to use least squares to better approximate the true conditional expectation function of earnings by Age?

  2. How would you do it?

  3. Try it out:

  4. How would you incorporate this into the model below?

ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, acs_data)